1 Weighting strategies for lectometric analyses

1.1 The dataset: selection of concepts

The set of analyzed concepts were all selected from the lexical variables generated by the enhanced Clustering-by-Committee algorithm (henceforth CBC 2.0). In order to qualify for selection we required the lexical variable to match in exact terms (i.e. in composition and in size) with the Dutch WordNet synset that is incorporated in Cornetto 1.3 (the release available in R2d2). We can distinguish four subgroups in this set of concepts, based on 1) the potential presence of polysemy in one its constituent lexicalisations and 2) the number of lexicalisations:

  • 10 concepts with 3 near-synonyms, one of which appears in more than 1 synset, and is therefore considered potentially polysemous. These concepts are: safety belt, ecstasy, wedding, display window, apartment building, chat, religion, appreciation, stress, individuality. They were chosen among the 16 exact matches with the aforementioned properties, and in addition by avoiding the extremely frequent concepts (more than 10000 tokens, before disambiguation).

  • 10 concepts with 2 near-synonyms, one of which appears in more than 1 synset, and is therefore considered potentially polysemous. These concepts are: reflection, demolition, keyboard, dip, repayment, stronghold, nail, sex, palm tree, pitch. As the exact matches with 2 near-synonyms constitute a sizeable set in the CBC output (450), we selected concepts such that A) they would have a total frequency lower than 1500, B) their set would vary along a large range of external uniformity values (actually from 0.5 to 0.9) and C) their polysemy structure was expected to emerge from the token-based vector spaces.

  • 11 concepts with 3 near-synonyms, of which none appears in more than 1 synsets. Therefore, we do not expect any of these variants to show a traditionally-defined polysemous 1. These concepts are: peace talks, submarine, retaliation, town center, pity, allegation, tour operator, filmmaker, suspicion, draw. To these initial 10 concepts we added an 11th one, namely convict. They were chosen among the 16 exact matches with the aforementioned properties.

  • 11 concepts with 2 near-synonyms, of which none appears in more than 1 synsets. These concepts are: commuter, opening hours, lack of space, electoral district, cash, scriptwriter, voyage of discovery, amusement park, database, nuclear reactor. To these initial 10 concepts we added an 11th one, namely living room. As the exact matches with 2 near-synonyms constitute a sizeable set in the CBC output (450), we selected concepts such that A) they would have a total frequency lower than 1500, B) their set would vary along the full spectrum of external uniformity values (actually from 0.1 to 0.9) and C) their polysemy structure was expected to emerge from the token-based vector spaces.

1.2 Weighting and filtering of concepts (with or without cluster analysis)

Different concepts can contribute in different ways to the aggregated lectal distances between the national varieties of Dutch, that is, Belgian Dutch and Netherlandic Dutch. In the Geeraerts, Grondelaers, & Speelman (1999) monograph, the procedure for determining a differential contribution of concepts consisted of three steps:

  • first, to establish whether the lectal profiles for a given concept are significantly different,
  • second, to include such differences in the aggregate calculation over concepts when they are significant, and to include the nonsignificant cases as exhibiting 100% uniformity or equivalently 0 distance (this means, in other words, that the significance test is precisely that: a test of the hypothesis that the observed differences represent ‘real’ differences),
  • third, when aggregating over different concepts, to give a different weight to the uniformity measure of given concept on the basis of its relative frequency.

The original 1999 method, and its subsequent refinements, would apply to concepts in their entirety, often after manual semantic disambiguation of the tokens had been carried out on the concept under scrutiny. Applying this same procedure to clusters within a concept would mean the following:

  • first, to establish whether the cluster-based lexical profiles of a given concept are significantly different;
  • second, to treat non-significant clusters as exhibiting 100% uniformity or equivalently 0 distance and to treat significantly different clusters as exhibiting ‘real’ differences;
  • third, to aggregate over the uniformity measures of different clusters weighted by the relative frequency of clusters.

The first and second parts of this method thus essential relies on stastical significance as a filter mechanism. When we approach the comparison of profiles within the statistical framework of hypothesis testing, we can consider the frequencies in a profile to be the sample of a random variable that has a multinomial distribution. The null hypothesis of such test states that both samples are drawn from the same population: the distribution of the synonyms does not differ between the national varieties, and, based on that, the inference would be that the national varieties cannot be considered different linguistic varieties. The alternative hypothesis consists of claiming that the onomasiological profiles do differ, and that we are dealing with separate linguistic varieties. A statistical test can then calculate a probability (i.e. the p-value) that quantifies how likely it is to find the observed differences in distribution of near-synonyms, given that the null hypothesis of no lectal differences holds. One rejects the null hypothesis and accepts the alternative hypothesis when this probability is lower than the customary significance level α = 0.05, which is the probability that we have chosen under which we reject the null hypothesis. The strength of statistical testing approach is that it is sensitive to how much evidence there is for the assumption that there actually is an underlying difference between the two profiles, so that we can avoid overrating distances that are not based on a substantial amount of tokens.

Statistical significance can also be employed in two other different ways: as a weighting measure or as a distance metric by itself. In the first case, instead of using the the p-value as a cut-off point for filtering, one could weight (i.e. multiply) the uniformity or distance by 1 - p-value, such that overall significantly different onomasiological profiles could have a larger impact on the aggregate lectal distance between Netherlandic Dutch and Belgian Dutch. This approach was suggested in Speelman, Grondelaers, & Geeraerts (2003: 366) and explored in Ruette (2012: 93). The influence of this type of weighting was nevertheless considered to be minimal. In the second case the p-value obtained after the statistical significance test would be used directly as a distance metric. This option was investigated in Speelman, Grondelaers, & Geeraerts (2003). In this report we will limit ourselves to using statistical significance as a filter mechanism.

Two statistical tests are going to be used in this respect: the Log Likelihood Ratio test and Fisher’s Exact Test. If the resulting p-value of such a test is lower than the customary significance level α = 0.05, the null hypothesis of the absence of difference between profiles is rejected, and the observed City Block distance can be safely assumed to show a real difference between the regiolects. If the p-value is higher than the α-level, the City Block distance is set to 0, no lectal distance, which corresponds to accepting the null hypothesis of no difference between the onomasiological profiles. Although the Log Likelihood Ratio test, and in general hypothesis testing, was precisely meant to avoid taking for granted lectal distances based on low-frequent profiles, it is a test that can still suffer from sparse data. Fisher’s Exact Test is known to suffer less from data sparseness in the cells of the contigency tables, and should therefore be even more reliable than the LLR for low-frequent profiles.

The weakness of this approach is that statistical significance can be manipulated by simply increasing the sample size of one or both onomasiological profiles in the comparison. Being dependent on sample size, the test will exaggerate the weight of frequent concepts, regardless of the lectal distance (in other words, the effect size) and downplay the weight of infrequent concepts (this criticism was already raised by Ruette (2012: 93).

To overcome this weakness, we can rely on statistical power as an (additional) filter mechanism. The statistical power of one of the binary hypothesis tests introduced above indicates, loosely speaking, the confidence we have that with the available data we are able to find a ‘real’ difference between profiles, given the presence of such a ‘real’ difference (i.e given that we have rejected the null hypothesis and accepted the alternative hypothesis). The power of a test is directly related to the type II-error probability β. The β-level of a test indicates the probability (we can tolerate) of falsely rejecting the true alternative hypothesis, and accepting the false null hypothesis. The corresponding statistical power of a test is 1 − β, and ranges from 0 to 1. Keeping the type II-error probability low means increasing the power of a study. Just as with the significance level α (which is the probability we can tolerate of falsely rejecting the true null hypothesis and accepting the false alternative hypothesis, that is, a the Type I-error), a threshold can be chosen in advance by the researcher, and the minimal power of a test this is usually set to 0.80. The computation of the statistical power depends on a combination of the significance level (α-level), the sample size (i.e. the total frequency of the compared profiles) and the effect size (i.e. the lectal distance, quantified as Cramèr’s V).

It is indeed possible to observe a significant difference in profiles between regiolects (where the p-value is lower than the significance level α), but that significance might be either due to the sample size (with the idea that the larger the sample size, the easier it will be to obtain a significant effect, regardless of the magnitude of the effect) or to a genuine, true difference between the regiolects. A power analysis is often conducted prior to the hypothesis stest, in order to calculate the minimum sample size required for observing an effect of a given size. But power can be estimated even post-hoc after a test has been conducted. Using statistical power as a filter mechanism then boils down to a similar procedure as with statistical significance: if the 1 - β is lower than the 0.80, meaning there is insufficient power to accept the alternative hypothesis, the City Block distance is set to 0, no lectal distance, which corresponds to accepting the null hypothesis of no difference between the onomasiological profiles.

WARNING! It is questionable whether this observed power can be a useful additional filter mechanism, or whether it is simply an unsound concept. I leave it up here for discussion!

1.3 Baseline scenario: lectal distances without semantic control

Before we venture in the calculation of lectal distances based on subregiones/clusters within the semantic space defined by a concept’s lexicalizations, it might instructive to look at lectal distances generated without any regard for the semantic structure of the lexemes involved. Lectal distances based on all retrieved tokens of the lexemes can then act as useful reference to which we can compare lectal distances obtained with more attention for the semantic reality behind those tokens.

It is evident that these calculations are flawed, because we know for sure that they are flawed: there has been no correction or selection of the tokens whatsoever. Although they have a rather limited use for description, they can showcase how the applying the different filters (significance based on LLR, based on Fisher’s Exact Test, based on power) yield different outcomes, and how the real analyses will/might look like.

The main finding resulting from the table above is that for the vast majority of concepts (36 out of 42 to be precise), the application of filters does not affect the lectal distances. For almost all concepts the hypothesis tests, be it LLR or Fisher’s Exact test, report significantly different distributions; therefore most City Block distances are kept. The concepts for which the tests cannot reject the null hypothesis of no lectal difference, and thus receive a distance of 0, are nuclear reactor, suspicion and religion. As these are all concepts that with the lowest lectal distances in the dataset, we can assume that the null hypothesis of no lectal difference could not be rejected because effectively there does not seem to be a difference between the profiles.

The filter based on statistical power sets the distance of another three concepts to 0: pitch, retaliation and ecstasy. These are the concepts with the next three lowest lectal distances in the database. It is important to note that the total concept frequency for all concepts is 1000, since we set a maximum threshold during their retrieval (justified by the purpose of the Frontiers in AI article). The third step in the lectometric procedure, that is, the relative weight of the concept, is therefore superfluous on this dataset in this case. Given the fixed values for the total sample size (i.e. 1000) and the significance level, the power of the test seems to depend solely on the effect size (i.e. Cramer’s V).

It is surprising to see that the concept convict keeps his City Block distance after the power filter, even though this distance is lower than the lectal distance of ecstasy. The Cramèr’s V value is slightly higher (0.095 vs. 0.094), but is lower than that same value forretaliation (0.097). It is not very clear what is happening here!

1.4 safety belt

Definitions of ‘safety_belt’.
City Block
Cluster Contigency table No filter LLR filter Fisher filter power filter Semantics
1
BE NL
riem 53 15
gordel 0 0
veiligheid_gordel 0 0
0.000 0.000 0.000 0.000 riem as ‘oar/paddle’, more in the idiom ‘roeien met de riemen die men heeft’
2
BE NL
riem 60 20
gordel 35 27
veiligheid_gordel 0 0
0.206 0.000 0.206 0.000
  • riem as ‘shoulder/waistbelt’, more specifically in the idiom ‘een hart onder de riem steken’
  • gordel as ‘belt’, more specifically in the idioms ‘een slag/stoot onder de gordel’
3
BE NL
riem 3 1
gordel 80 19
veiligheid_gordel 61 8
0.138 0.000 0.000 0.000 ‘safety belt’
4
BE NL
riem 70 92
gordel 39 9
veiligheid_gordel 1 1
0.266 0.266 0.266 0.266
  • gordel as ‘waistbelt’, often as ‘judo/karate belt’
  • riem as ‘waistbelt’, often in fashion context (with leren)
5
BE NL
riem 80 28
gordel 180 69
veiligheid_gordel 36 13
0.019 0.000 0.000 0.000
  • gordel as ‘encircling area’ (e.g. groene gordel) and as ‘safety belt’
  • riem as ‘waistbelt’, more specifically in the idiom ‘de riem afleggen’ and in fashion context